Homogenous Ensemble Learning in Highly Imbalanced Data¶
Data Science is about understanding the data, 理解数据
Name(s): Kaiwen Bian & Bella Wang
Website Link: https://kevinbian107.github.io/ensemble-imbalanced-data/
Content for this Project¶
- Introduction
- Data Cleaning, Transformation, and EDA
- Transformation
- Univariate & Bivariate Analysis
- Aggreagted Analysis
- Textual Feature Analysis
- Assessment of Missingness Mechanism
- MAR Anlaysis
- NMAR Analysis
- Permutation Testing of TF-IDF
- Framing a Predictive Question
- Baseline Model: An Naive Approach
- Handling Missingness in Data
- Train/Val/Test Split
- Feature Engineering
- Final Model: Homogenous Ensemble Learning
- Feature Engineering (Back to EDA)
- Model Pipeline
- Hyperparameter Tuning
- Evaluation
- Feature Importantness
- Confusion Matrix, Evaluation Metrics, and ROC_AUC
- Fairness Analysis
# for eda and modeling
import pandas as pd
import numpy as np
pd.options.plotting.backend = 'plotly'
from utils.dsc80_utils import *
from itertools import chain
Step 1: Introduction¶
Predictive model (classifier) detecting user preference using textual features in combnation with other numerical features is the key first step prior to building a reconmander system or doing any other further analysis. The challenge that is addressed in this project is related to the high imbalance nature of the recipe data set that we are using.
Random Forest Algorithm¶
In this project, we will adapt ideas of homogenous ensemble learning where we will use multipl Decision Trees, and making them into a Random Forest for more robust predictions of the data.
A Decision Tree essentially learns to come up with questions or decisions at an high dimensional space (depending on the number of features) and then separate the data using "boxes" or "lines" in that way. The core mechanism that allows it to happen is using entropy minimization where the model tries to reduce the entropy, or uncertainty of each split, making one catagory fit to one side and the other catagory to the other side.
\begin{align} \text{entropy} &= - \sum_C p_C \log_2 p_C \end{align}A Random Forest essentially is when at the splitting point of data to train/test/val, a random subset of features is taken out instead of choosing from all of them and then spliting the tree base on this subset of the feature, usually speaking $m = sqrt(d)$ seems to work well in practice and it is also the default that sk_learn uses. This allows each decision trees to come up with different prediction rules for later on voting an best one
- Notice that we are not doing simple boostrap of the data as each decision tree may not resemble too great of a difference in that way, instead, we are taking different features directly using the same type of model (decision tree), making it a homogenous ensemble learning method.
- We want the individual predictors to have low bias, high variance, and be uncorrelated with each other. In this way, when averaging (taking votes) them together, low bias and low variance would occur.

Step 2: Data Cleaning and Exploratory Data Analysis¶
interactions = pd.read_csv('food_data/RAW_interactions.csv')
recipes = pd.read_csv('food_data/RAW_recipes.csv')
Merging¶
Initial merging is needed for the two dataset to form 1 big data set
- Left merge the recipes and interactions datasets together.
- In the merged dataset, fill all ratings of 0 with np.nan. (Think about why this is a reasonable step, and include your justification in your website.)
- Find the average rating per recipe, as a Series.
- Add this Series containing the average rating per recipe back to the recipes dataset however you’d like (e.g., by merging). Use the resulting dataset for all of your analysis. (For the purposes of Project 4, the 'review' column in the interactions dataset doesn’t have much use.)
Transformation¶
- Some columns, like
nutrition, contain values that look like lists, but are actually strings that look like lists. We turned the strings into actual columns for every unique value in those lists - Convert to list for
steps,ingredients, andtags - Convert
dateandsubmittedto Timestamp object and rename asreview_dateandrecipe_date - Convert Types
- Drop same
id(same withrecipe_id) - Replace 'nan' with np.NaN
Type Logic¶
String: [name, contributor_id, user_id, recipe_id, ]- quantitative or qualitative, but cannot perform mathamatical operations (quntitative discrete)
nameis the name of recipecontributor_idis the author id of the recipe (shape=7157)recipe_idis the id of teh recipe (shape=25287)idfrom the original dataframe also is the id of the recipe, dropped after merging
user_idis the id of the reviewer (shape=8402)
List: [tags, steps, description, ingredients, review]- qualitative, no mathamatical operation (qualitative discrete)
int: [n_steps, minutes, n_ingredients, rating]- quantitative mathamatical operations allowed (quantitative continuous)
float: [avg_rating, calories, total_fat sugar, sodium, protein, sat_fat, carbs]- quantitative mathamatical operations allowed (quantitative continuous)
Timestamp: [recipe_date, review_date]- quantitative mathamatical operations allowed (quantitative continuous)
Below are the full implementation of initial, which does the merge conversion, then transform, whcih carries out the neccessary transformation described above
def initial(df):
'''Initial claeaning and megrging of two df, add average ratings'''
# fill 0 with np.NaN
df['rating'] = df['rating'].apply(lambda x: np.NaN if x==0 else x)
# not unique recipe_id
avg = df.groupby('recipe_id')[['rating']].mean().rename(columns={'rating':'avg_rating'})
df = df.merge(avg, how='left', left_on='recipe_id',right_index=True)
return df
def transform_df(df):
'''Transforming nutrition to each of its own catagory,
tags, steps, ingredients to list,
submission date to timestamp object,
convert types,
and remove 'nan' to np.NaN'''
# Convert nutrition to its own caatgory
data = df['nutrition'].str.strip('[]').str.split(',').to_list()
name = {0:'calories',1:'total_fat',2:'sugar',3:'sodium',4:'protein',5:'sat_fat',6:'carbs'}
#zipped = data.apply(lambda x: list(zip(name, x)))
new = pd.DataFrame(data).rename(columns=name)
df = df.merge(new,how='inner',right_index=True, left_index=True)
df = df.drop(columns=['nutrition'])
# Convert to list
def convert_to_list(text):
return text.strip('[]').replace("'",'').split(', ')
df['tags'] = df['tags'].apply(lambda x: convert_to_list(x))
df['ingredients'] = df['ingredients'].apply(lambda x: convert_to_list(x))
# it's correct, just some are long sentences, doesn't see "'", notice spelling
df['steps'] = df['steps'].apply(lambda x: convert_to_list(x)) #some white space need to be handled
# submission date to time stamp object
format ='%Y-%m-%d'
df['submitted'] = pd.to_datetime(df['submitted'],format=format)
df['date'] = pd.to_datetime(df['date'],format=format)
# drop not needed & rename
df = df.drop(columns=['id']).rename(columns={'submitted':'recipe_date','date':'review_date'})
# Convert data type
df[['calories','total_fat','sugar',
'sodium','protein','sat_fat','carbs']] = df[['calories','total_fat','sugar',
'sodium','protein','sat_fat','carbs']].astype(float)
df[['user_id','recipe_id','contributor_id']] = df[['user_id','recipe_id','contributor_id']].astype(str)
# there are 'nan' values, remove that
for col in df.select_dtypes(include='object'):
df[col] = df[col].apply(lambda x: np.NaN if x=='nan' else x)
return df
merged = recipes.merge(interactions, how='left', left_on='id', right_on='recipe_id')
cleaned = (merged
.pipe(initial)
.pipe(transform_df))
display_df(cleaned)
| name | minutes | contributor_id | recipe_date | ... | sodium | protein | sat_fat | carbs | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 brownies in the world best ever | 40 | 985201 | 2008-10-27 | ... | 3.0 | 3.0 | 19.0 | 6.0 |
| 1 | 1 in canada chocolate chip cookies | 45 | 1848091 | 2011-04-11 | ... | 22.0 | 13.0 | 51.0 | 26.0 |
| 2 | 412 broccoli casserole | 40 | 50969 | 2008-05-30 | ... | 32.0 | 22.0 | 36.0 | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 234426 | cookies by design sugar shortbread cookies | 20 | 506822 | 2008-04-15 | ... | 4.0 | 4.0 | 11.0 | 6.0 |
| 234427 | cookies by design sugar shortbread cookies | 20 | 506822 | 2008-04-15 | ... | 4.0 | 4.0 | 11.0 | 6.0 |
| 234428 | cookies by design sugar shortbread cookies | 20 | 506822 | 2008-04-15 | ... | 4.0 | 4.0 | 11.0 | 6.0 |
234429 rows × 23 columns
Now this code would be used later on when we need to groupby using the recipe_id column or the user_id column for different purposes. The handling for different columns are also defined as below, which is diffeernt according to waht we need the columns are for later on in the modeling process.
def group_recipe(df):
func = lambda x: list(x)
check_dict = {'minutes':'mean', 'n_steps':'mean', 'n_ingredients':'mean',
'avg_rating':'mean', 'rating':'mean', 'calories':'mean',
'total_fat':'mean', 'sugar':'mean', 'sodium':'mean',
'protein':'mean', 'sat_fat':'mean', 'carbs':'mean',
'steps':'first', 'name':'first', 'description':'first',
'ingredients':func, 'user_id':func, 'contributor_id':func,
'review_date':func, 'review':func, 'recipe_date':func,
'tags':lambda x: list(chain.from_iterable(x))}
grouped = df.groupby('recipe_id').agg(check_dict)
#grouped['rating'] = grouped['rating'].astype(int)
return grouped
def group_user(df):
'''function for grouping by unique user_id and concating all steps/names/tags of recipe and averaging rating give'''
return (df #[df['rating']==5]
.groupby('user_id')['steps','rating','name','tags','minutes','calories','description','n_ingredients','n_steps','ingredients','contributor_id','review']
.agg({'steps':lambda x: list(chain.from_iterable(x)),
'name':lambda x: list(x),
'tags':lambda x: list(chain.from_iterable(x)),
'rating':'mean',
'minutes':'mean',
'calories':'mean',
'description':lambda x: list(x),
'n_ingredients':'mean',
'n_steps':'mean',
'ingredients':lambda x: list(chain.from_iterable(x)),
'contributor_id':lambda x: list(x),
'review':lambda x: list(x),
})
)
Univariate & Bivariate Analysis¶
Okay, after data cleaning, let's draw some graph to see what kind of data we are dealing with
px.violin(cleaned, x=['sodium','calories','minutes'])